-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: support saving RepoTree objects directly to cache #3825
Conversation
for fname in tree.walk_files(path_info): | ||
checksum = tree.get_file_checksum(fname) | ||
file_info = { | ||
self.PARAM_CHECKSUM: checksum, | ||
self.PARAM_RELPATH: fname.relative_to(path_info).as_posix(), | ||
} | ||
self._save_file(fname, checksum, tree=tree) | ||
dir_info.append(file_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual behavior for this walk
depends on how tree
was initialized.
Git files will always be saved directly from the tree in _save_file()
via tree.open()
For DVC outs:
- If
fetch=False
andstream=False
,walk
will not recurse into DVC out directories - If
fetch=True
, DVC outs will be fetched whenwalk
recurses into the out directory_save_file
will skip thetree.open()
->copy for fetched outs (since they will already exist in cache)
- If
stream=True
, DVC outs will be saved by streaming in_save_file()
viatree.open()
Basically, for erepo's and import/get, we probably always want the behavior from fetch=True
.
if tree: | ||
checksum = self._save_tree(path_info, tree) | ||
else: | ||
dir_info = self.get_dir_cache(checksum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed this before, but just to clarify: I suppose we need this because we are not ready to make _collect_dir
use the tree, just yet, right? Because it uses self.walk_files
, which raises another question of abstracting tree-like Remote methods somehow (maybe in RemoteTree's or something)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can add tree support in _collect_dir
, but it made the workflow look a bit strange to me, since _collect_dir
is used in get_dir_checksum
. If you go off of the current behavior for fetch
ing external git files, we would do something along the lines of:
# get a dir checksum for the git dir via `get_dir_checksum()`/`_collect_dir()`/`tree.walk_files()`
dir_cache_info = cache.save_info(path, tree=tree)
# walk the tree again and save git file objects from tree
cache.save(path, dir_cache_info, tree=tree)
But walking the tree twice like this seems unnecessary for this use case, since we can save objects from the tree and collect the dir cache info at the same time during a single walk, and I wasn't sure if saving objects from tree during the walk would be desired behavior for _collect_dir
(because saving objects in a get_dir_checksum
call definitely seems like non-desired behavior).
Thinking about it now, what we could do is add the tree
param to _collect_dir
, and if it's called with a tree then we just save the objects from tree during walk_files
, and use _collect_dir
here in place of _save_tree
. And when _collect_dir
is used w/o a tree param, it doesn't save any objects during walk_files
and behaves the same way as before for get_dir_checksum
calls.
if tree: | ||
isdir = tree.isdir | ||
save_link = False | ||
else: | ||
isdir = self.isdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how RemoteTree(or something like that) becomes more and more apparent π
And that the state should probably belong to the tree. So here it would be a noop for git tree.
Not saying we should implement all of that right now, just noticing the things that we've discussed earlier π
if self.changed_cache(checksum): | ||
self.move(path_info, cache_info, mode=self.CACHE_MODE) | ||
self.link(cache_info, path_info) | ||
elif self.iscopy(path_info) and self._cache_is_copy(path_info): | ||
# Default relink procedure involves unneeded copy | ||
self.unprotect(path_info) | ||
else: | ||
self.remove(path_info) | ||
self.link(cache_info, path_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also think about making save()
only save without checking out(these link
s are effectively checkout
). This if&else
makes it clear that this method tries to do too much. But at the same time move
is a very nice optimization for big files that allows us to make it happen instantly if we are within the same fs.
I would be fine with keeping it as is for now, but let's at least keep this in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's merge for now and proceed with the next steps. From the discussions above, it is clear that we will likely abstract away more things when working with trees in remotes.
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
β I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. π
Related to #3811.
added support for:
RepoTree
/DvcTree.open()
(Will close api: support granularity (files inside dirs)Β #3541)RepoTree.open()
(Repo.open_by_relpath()
,api.open()
)fetch=True
orstream=True
ex:
For saving tree paths, rather than requiring the call to
cache.save_info(...)
to get a destination checksum, for tree paths the checksum will be calculated duringsave()
. The reason for this is that for directories, it makes more sense to collect dir cache information as we walk the tree (and save files from the tree), rather than walking the tree once to collect dir cache information, and then walking it a second time to save files.